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ABSTRACT 



Three local buses and three composite operation buses are 
provided in each processing element An arithmetic logic 
unit, a multiplier, a bit operator, and an accumulator are 
connected to respective local buses and the composite 
operation buses. As a result, each operation unit can transfer 
data efficiently using a plurality of buses of different func- 
tions. 
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SIMD PROCESSOR OPERATING WITH A Another object of the present invention is to provide an 

PLURALITY OF PARALLEL PROCESSING SIMD processor that can carry out a process at high speed. 

ELEMENTS IN SYNCHRONIZATION a further object of the present invention is to provide an 

BACKGROUND OF THE INVENTION SIMD processor that can transfer data between processing 

, „ , . _ 5 elements at high speed. 

L Held of the Invention Stm anomer object of the present invention is to provide 

tio^MSrD^r^S^ processor that can have circuit complexity 

with a plurality of parallel processing elements in reduced. 

synchronization, and that is controlled with a single instruc- >n A stfll further object of the present invention is to provide 

tion indicated by a unitary program counter. More 10 an SIMD processor that can have the number of control 

particularly, the present invention relates to an SIMD pro- buses reduced. 

cesser suitable for image processing. Yet a further object of the present invention is to provide 

2. Description of the Background Art an SIMD processor that can have instruction description 

A conventional image processing SIMD processor will be simplified, 

described hereinafter with reference to a block diagram of 15 An SIMD processor according to an aspect of the present 

FIG. 22 showing a structure thereof . invention includes an overall control unit, a plurality of 

An SIMD processor includes a control unit 100 for processing elements, a global bus for connecting unidimen- 

carrying out the overall control in a programmable manner, sionally each of the plurality of processing elements in 

a memory unit 101 for storing load coefficients and template parallel, and a control bus for connecting the control unit 

data, a data unit 192 including shift registers 121a-121e for 20 witn eacn 0 f me plurality of processing elements. Each of 

transferring image data, a processor unit 103 having a rae plurality of processing elements includes a local 

plurality of processing elements (PE) UU-i3U arranged memory , a plurality of operation units, a data input/output 

in parallel including an arithmetic logic unit (ALU) 132 and unit th[£e local tases connected to the local memory, the 

amultiplier (MFY) 133, a linkage unit 1<M including anth_ of operation units, and the data input/output unit for 

metic logic units 141 and 142, and an evaluation unit 105 25 £^ md a composite operation bus connected 

formed of a comparator. t0 ^ operation unit for transferring data to carry out a 

Each of processing elements 131o-13W in processor unit composit e operation. The overall control unit controls the 

103 carries out the same process on different data transferred operati on of each of the plurality of processing elements to 

in parallel from data unit 102 or memory unit 101 according ' ^ fte samc operation. 

to a control signal from control unit 100. The calculation 30 According to the above structure, data can be transmitted 

result of each of processmg .elements 13 la-13W is trans- ^ of me mree ^ ^sts and to the composite 

fared to linkage unit 104, whereby an integration process !S atkjn bus in tte srM D processor. Therefore, the data 

carried out between processing elements 131o-131<i For efficiency between processing elements r 



cation of a pixel in each horizontal direction and a load 35 



coefficient in a local window h ^ carried , outi in parallel by ^ SJMD r acconlillg t0 another aspect 0 f the 

each of processmg elements UU-Uld. The results are mV entionmcl U des an overall control unit, a plurality 

added in murage unit i<». processing elements each having a local memory, and a 

Animpiirocessmcludesawetyofprocessessuchas ^ bus ^ connectill ^^^^y each of the 
operation between irr^es.^suremeat of an image area or 40 *j rf ^ ^ats in parallel. The global bus 
of me center point, and pattern matching, in addition to the a ^ ^ obal B bus fa transnritting an output data of 

above filtering process. Most of the processes can be earned ^ overaI1 contro , ^ to each of mc plurality of processing 
out by a single operation process with respect to a great ele ^ ntS , a second global bus for providing each output data 
amount of data. It ^therefore effective to carry out various of processing elements to each of the 

algorithms by modifying the program using an SIMD pro- 45 ^ elemen £, a„ d a third global bus for 

cesser from the standpoint of saving the hardware. providing the data of a local memory of one of the plurality 

Although a conventional SIMD processor of the above- of processing e i em ents to another processing element, 
described structure can carry out a filtering process, opera- Accor <iing to the above structure, the provision of three 
tions specific to image processing such as product-sum buses of different functions in an SIMD processor 

operation and bit operation cannot be carried out due to the » ^ ^ be tween processing elements in a flex- 

inadequacy of the processing element function. i b i e and speeded manner 

Furthermore, a conventional SIMD processor required a ^ SJMD t according to a further aspect of the 

greatamountofprocessmg steps. The operated resubs of the dgW ssing eleme „, s pro- 

processing elements can be integrated only through a link- ^ aDd eight global buses connecting the eight 

ageunitThemtegratedresu^ « ioJcaUyTequal intervals. iLh of me 

processing element There was the problem thata process v processing elements receives data from four predeter- 
using the integrated resuh t cannot be effected. Thus even Xd global buses out of the eight global buses, and 
though a conventional SIMD processor is programmable, it f predetermined global buses out of the 

had various problems such as low processing speed and four global buses. 

insufficient functions. It was effective only for a partial « * connection of the 

processmg of *e = pr=g field, and its range of clements i ogicaUy at equal intervals in an 

application was extremely hmited. SIMD processor allows datf transfer between processing 

SUMMARY OF THE INVENTION elements at high speed and also with respect to a combina- 

An object of the present invention is to provide an SIMD 65 tion of various processing elements, 
processor of a wide application range that can carry out An SIMD processor according to still another aspect of 
various processes. the present invention includes a plurality of processing 
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dements. Each of the plurality of processing elements signal can be selectively operated according to an operation 

includes a local memory, and an input unit for selectively result Therefore, processes of greater variety can be carried 

providing an output data of the local memory of a processing out 

element to the local memory of an adjacent processing An SIMD processor according to yet a still further aspect 
element Tie local memories are connected in series. 5 of the present invention includes an overall control unit, a 
According to the above structure, the connection of the plurality of processing elements having a plurality of opera- 
local memories in each processing element in series in a tion units each operating in response to a control signal, a 
chain-like manner allows the local memory to be functioned global bus connecting umdimensionally the overall control 
as a line memory effective for image processing. Since an unit with each of the plurality of processing elements in 
internal local memory can be used individually or in a 1° parallel, and a control bus connecting the overall control unit 
series-connected manner, it can be made to function as a line with each of the plurality of processing elements. The 
memory effective for image processing, and also allows overall control unit includes a pipeline unit for delaying a 
external data independent of each local memory to be input control signal corresponding to each of the plurality of 
As a result, data transfer can be carried out speedily. High control units via a pipeline. The pipeline unit provides a 
speed processing is allowed by operating each processing " plurality of pipeline-delay values required for pipe-insertion 
element in parallel. The SIMD processor can be utilized in and a pipeline-delay control signal to each of plurality of 
a wide range of applications. processing elements via a control bus. Each processing 
An SIMD processor according to a still further aspect of element further includes a comparator for comparing a flag 
the present invention includes a plurality of processing corresponding to each operation result output from the 
eleirwnt,wheremeachofmeprocessingelementshasalocal 20 plurality of operation units with a condition determination 
memory. The local memory includes three bank memories, code provided from the overall control unit via a control bus, 
each controllable individually. and a mask unit responsive to a plurality of pipeline-delay 
According to the above structure, two bank memories can values Md a comparison result of the comparator for mask- 
carry out a readout operation while one bank memory carries ln « a 81 8 nal outout from * e M* 1 "? 



out a writing operation at a same time. Therefore, processing 25 * e masked control signal to corresponding plurality of 

at a high speed is possible. operation units. 

An SIMD processor according to yet a further aspect of According to the above structure, instruction description 

the present invention includes an overall control unit, a <* » control signal is facilitated man SIMD processor^ 

plurality of processing elements, a global bus connecting „ code deteimuiation instruction can be described at an arbi- 

unidimensionally each of the plurality of processing ele- trary position. Therefore, the number of instruction steps can 

ments in parallel, and a control bus for connecting the 06 reduced to improve the processing rate, 

overall control unit with each of the plurality of processing An SIMD processor according to another aspect of the 

elements. Each processing element includes a plurality of present invention includes a plurality of processing 

operation units each operating in response to a control elements, a link processing unit, and a global bus connecting 

signal, a decoder for decoding an operation code transmitted unidimensionally each of the plurality of processing ele- 

from the overall control unit via a control bus to provide a ments and the link processing unit The link processing unit 

plurality of control signals and a pipeline delay signal includes at least an arithmetic logic unit capable of addition 

corresponding to the plurality of operation units, and a and detection of a rnaximum value/minimum value, and a 

plurality of pipeline registers provided corresponding to m local memory for storing data. 

each of the plurality of control signals, receiving a cone- According to the above structure, accumulation and sart- 

sponding control signal out of the plurality of control ing of the outputs of the processing elements can be carried 

signals, and responsive to a pipeline delay signal corre- out without having to transfer data between each of the 

sponding to a plurality of pipeline delay signals for provid- processing elements. Therefore, the processing operation by 

ing a control signal at a predetermined tuning. 4S each processing element is speeded, and the integration 

According to the above structure, the control signal output function of data between the processing elements is 

from the overall control unit includes only an operation improved. 

code. Therefore, the number of control buses can be An SIMD processor according to a further aspect of the 

reduced. present invention includes eight processing elements, a link 

An SIMD processor according to yet another aspect of the 50 processing unit and a global bus connecting unidimension- 

present invention includes an overall control unit a plurality ally each of four processing elements and the link processing 

of processing elements, a global bus connecting unidimen- unit The link processing unit includes eight divide units 

sionally each of the plurality of processing elements in dividing each output data of the eight processing units into 

parallel, and a control bus connecting the overall control unit upper data and lower data and selecting and providing either 

with each of the plurality of processing elements. Each 55 Ae a PP er level data or the lower level data. The link 

processing element includes a plurality of operation unite processing unit combines the eight output data from the 

each operating in response to a control signal, a comparator eight divide units for providing four output data, 

for comparing a flag corresponding to an operation result According to the above structure, the outputs of the 

output from each of the plurality of operation units with a processing elements can be provided outside the SIMD 

condition determination code applied from the overall con- 60 processor in various modes. Data can be output with a 

trol unit via a control bus, and a mask unit responsive to the reduced number of external output lines with respect to eight 

comparison result of the comparator for masking a control parallel outputs. 

signal output corresponding to each of the plurality of An SIMD processor according to a further aspect of the 

operation from the overall control unit units via the control present invention includes a plurality of processing 

bus, and providing the masked control signal. 65 elements, a link processing unit a global bus connecting 

According to the above structure, a plurality of processing unidimensionally each of the plurality of processing ele- 

elements operating in parallel according to the same control ments and the link processing unit The link processing unit 
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includes a sorting unit for sorting a plurality of data applied FIG. 22 is a block diagram showing a structure of a 

from the plurality of processing elements to the global bus, conventional SIMD processor, 
and a code allocation unit for allocating a predetermined 
code to each of the plurality of date sorted by the sorting 

According to the above structure, it is not necessary to First Embodiment 
generate a code by the processing element Therefore, the Referring to FIG. 1, an SIMD processor includes an 
circuit complexity can be reduced. Furthermore, the process overall control unit (CU) CUa for controlling each process- 
is speeded since code allocation and sorting are carried out mg e i emen t, a plurality of processing elements (PE) 



DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 



parallel to the operation of the processing elements. 10 PEaO-PEan. a global bus GB for connecting 

The foregoing and other objects, features, aspects and ally each of processing elements PEaO-PEan in parallel, and 
advantages of the present invention will become more a control bus CB for connecting overall control unit CUa and 
apparent from the following detailed description of the each of processing elements PEa-PEn. 
present invention when taken in conjunction with the Each of processing elements PEaO-PEan includes a local 
accompanying drawings. memory (LM) LMa, a data input/output unit (10) IOa, an 



ALU block ALB, an MPY block MB, a BMU block BB, an 

BRIEF DESCRIPTION OF THE DRAWINGS AXJ block AUB , local buses LB0-LB2. and composite 

FIG. 1 is a block diagram showing a structure of an SIMD operation buses MOB0-MOB2. 

processor according to a first embodiment of the present 20 ALU block ALB includes registers Rl (REG) Rl-413, a 

invention. selector (SEL) SI, and an arithmetic logic unit (ALU) AL. 

invention Processing elements PEaO-PEan are controlled by a con- 

_„ " ta . , . .„ . WmH „ t „ nit - tool signal output from overall control unit CU via control 

FIG. S shows a structure of the data input/output unit of ^ ^ r^ch processing element carries out the same 

HG - 4 - 30 process, and data transfer between the processing elements 

FIG. 6 shows the relationship between a global bus r aarici ^ through global bus GB. 

input/output control signal and input/output selection of the ^ Wock ALB MpY block MB, BMU block BB, and 

data input/output unit AU Wock A UB in a processing element are connected by 

FIG. 7 is a block diagram showing a structure of the main loca j buses LB0-LB2. 

components of an SIMD processor according to a fourth 35 A local memorv LM a h controlled by a control signal 

embodiment of the present invention. provided via control bus CB to store data read out from local 

FIG. 8 is a diagram for describing local processing of bus LB2. The data read out from local memory LMa is 

filtering. provided to local buses LB© and LB1. 

FIGS. 9 and 1* are block diagrams showing a structure of Data input/output unit IOa controls data input/output 

the main components of an SIMD processor according to 40 wnen data in each of processing elements PEaO-PEan is to 

fifth and sixth embodiments, respectively, of the present D e transferred to and from another processing element, 

invention. Arithmetic logic unit AL carries out an arithmetic opera- 

FIG. 11 is a block diagram showing a structure of a PE tion such as addition, subtraction, and absolute value, or a 

operation control unit of FIG. 10. 4J logical operation such as inclusive OR, AND, and exclusive 

FIG. 12 shows the relationship between instruction and OR on two inputs with the data transferred from local buses 

control of the SIMD processor of FIG. 11. LB© and LB1 via registers Rl and R2 as the source 

FIGS. 13 and 14 are block diagrams showing a structure according to a control signal provided from overall control 

of an SIMD processor according seventh and eighth unit CUa. The res^t of aritoeUc logic una AL is prided 

embodiments, respectively, of the present invention. » to register!*, MPY block MB BMU block BB, and AU 

FIG. 15 is a b^ck digram showing a structore of a block AUB via composite operation bus MOBO 

™nei,w n-oitffr of Fir, 14 Selector S2 receives data of local bus LBO and data of 

pipeline register of FIG. 14. composite operation bus MOBO and provides either data to 

FIG. 16 is a diagram for describing an operation of the ^ selectOT S3 se f e ctively provides data 

pipeline register of FIG. 15. ^ rf local ^ lbi composite operation bus MOBO to 

FIG. 17 is a block diagram showing a structure of an register R fi. Multiplier MP receives data of registers R5 and 

SIMD processor according to a ninth embodiment of the R6 More speciflca riy, multiplier MP receives either output 

present invention. of toca i bus LBO or arithmetic logic unit AU at its first input 

FIG. 18 is a diagram for describing a sorting process. and an output of either local bus LB1 or arithmetic logic unit 

FIG. 19 is a block diagram showing a structure of an ^ AU at its second input to carry out multiplication therebe- 

SIMD processor according to a tenth embodiment of the tween. The result of multiplier MP is provided to register 7 

present invention. or MPY block MB and AU block AUB via composite 

FIG. 20 is a block diagram showing a structure of the operation bus MOP1. 

interface unit of FIG. 19. Selector S5 receives data of local bus LBO and composite 



FIG. 21 is a block diagram showing a structure of an 65 operation bus MOB1 and selectively provides either c 
SIMD processor according to an eleventh embodiment of register R8. Similarly, selector S6 provides data of eitner 
the present invention. local data bus LB1 or composite operation bus MOB1 to 
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register R9. Bit operator BM receives data of registers R8 out by shifter BM3 for arranging the digit positions. The 
and R9. More specifically, bit operator BM receives either shifted result is transmitted as an input of accumulator AU 
output of local data bus LB© or multiplier MP at its first via composite operation bus MOB2. Accumulator AU car- 
input, and an output of either local bus LB1 or arithmetic lies out an addition calculation of the data of register R12 
logic unit AL at its second input, and carries out an operation s and the input By the above-described processes, arithmetic 
mainly classified into two types as set form in the following logic unit AL always carries out subtraction, multiplier MP 
on the two input data. always carries out multiplication, bit operator BM always 

FIG. 2 is a block diagram showing a structure of the bit carries out arithmetic shifting and accumulator AU always 

operator of FIG. 1. Referring to FIG. 2, bit operator BM carries out accumulation. Therefore, execution of a sum of 

mdudesalogicunitBMl,abitcounteTBM2,ashifterBM3, "> squared difference operation is repeated continuously with- 

and a selector Sll. out having to insert a data transfer instruction therebetween. 

The first operation of bit operator BM is a shifting Other composite operations include a sum of the absolute, 

operation carried out by logic unit BM1 and shifter BM3. or accumulation of the number of "l"s in data after a 

More specifically, logic shifting and arithmetic shifting are masking operation on a local memory data that is frequently 

carried out on the inclusive ORed or ANDed result of first 15 carried out on binary images. These composite operations 

input BS1 provided from register R8 and second input BS2 can also be carried at high speed as described above. As to 

provided from register R9, or on one of the two inputs BS1 operation units that are not used in the composite operation 

and BS2. process during a composite operation execution or as to 

The second operation is to count the number of "l"s in operation units mat have its operation already finished in a 

second input BS2 of bit counter BM2. One of the first and 20 composite operation process, a process belonging to a dif- 

second operation results is selected by selector Sll to be fereM operation instruction can be carried out simultaneous 

provided as output BS3. Output BS3 is provided to register t0 ™e composite operation instruction without having to 

RIO and to AU block AUB via composite operation bus wait for completion of a composite operation process. 

MOB2. The output of each operation unit is stored in the output 

Referring to FIG. 1 again, selector S8 receives each data 25 side registers R3, R7, Rl© and R12, and then transferred to 

of arithmetic logic unit AL, multiplier MP, and bit operator local memory LMa or to data input/output unit IOa via local 

BM via composite operation buses MOBw-MOBl. One of bus LB2. B is therefore possible to use registers R3, R7, RIO 

the three input data is provided to register Rll. Accumulator and R12 as primary registers. Thus, data of each of registers 

AU is applied with data of register Rll and its own output „ R3, W, RIO and R12 or data written at the same time local 

via register R12. Accumulator AU sets the data selected by memory LMa is written can become the source of the next 

selector S8 or the accumulated data of the selected data and instruction without passing through local memory LMa. 

the data of register R12 to register Rll. In the STMD processor of the first embodiment including 

Each data of the output side registers R3, R7, RIO, and a plurality of processing elements arranged in parallel, there 

R12 of each operation unit is selectively provided to local 35 are provided functional blocks such as a local memory, a 

buses LB0-LB2 via selectors SI. S4, S7 and S9, respec- data input/output unit, an arithmetic logic unit, a multiplier, 

Kvelv. a Dit operator, and an ai * 



ave iy. a bit operator, a 

In a general operation, one result is obtained on the basis two inputs and one output The output of each operator is 
of two data inputs. In the present errfwdiment, local buses a PP hed to a 01 &e ^P* * lde OT t0 a se ™ !tor of me 
LB© and LB1 are used as two input data buses. The <o side of another operation umt An output of an output 
operation result is written into local memory LMa. Local bus r ^ ste [ ^ selectively provided to a local bus. 
LB2 is used as an output data bus for data exchange between Therefore, the data transmission efficiency within a process- 
each processing element By providing three local buses ^J^f * ^ Vanous operations can be carried out at 
LB0-LB3 within a processing element, required data trans- "8" s P ccd - 
fer can be carried out individually at the same time with 45 Second Embodiment 

respect to one operation instruction. Therefore, high speed Referring to FIG. 3, an SIMD processor includes an 
processing is allowed. overall control unit CUa, and a plurality of processing 
In a composite operation such as a sum of squared elements PEb©-PEbn. Each of processing elements 
difference where processing must be carried out sequentially PEbO-FEbn includes a data input/output unit IOb, a local 
through a plurality of operation units, data transfer is carried 50 memory LMb, and local buses LB0-LB2. Similar to pro- 
out via composite operation buses MOB0-MOB2 directly cessing elements PEaO-PEan of FIG. 1, processing elements 
connecting each of the operation units. Data is not trans- of PEbO-PEbn of FIG. 3 includes an ALU block ALB, an 
ferred via the output side registers R3. R7, Rl© of each MPY block MB, a BMU block BB, and an AU block AUB 
operation unit and local buses LB0-LB2. In the processing which are not illustrated for the sake of simplification. The 
elements of the present emlwdiment, the same sum of 55 operations of these blocks are sirnilar to those of processing 
squared difference operation can be carried out for every elements PEaO-PEan shown in FIG. 1. and their detailed 
machine cycle. More specifically, in the case of a sum of description will not be repeated. 

squared difference operation, two data read out simulta- The characteristic features of me second embodiment will 

neously from local memory LMa are applied to arithmetic be described hereinafter. Data read out from local memory 

logic unit AL via local buses LB© and LB1, whereby 60 LMb is provided to local buses LB© and LB1 as well as to 

subtraction of the inputs is carried out The subtracted result data input/output unit IOb. Data input/output unit IOb is 

is provided via composite operation bus MOB© as the two connected to the outside world via a global bus GIB for 

inputs of multiplier MP. Multiplier MP carries out a square transmitting data provided from overall control unit CUa, a 

operation according to the two inputs. The square operation global bus GPB for exchanging data between processing 

result is provided via composite operation bus MOB1 as an 65 elements PEhfl-PFbn. and a global bus GMB for providing 

input of bit operator BM. The input data directly passes data applied to data input/output unit IOb from local 

through logic unit BM1, and an arithmetic shifting is carried memory LMb. 
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...,..^—5 data included in an instruction applied to The data input/output unit of FIG. 4 wfll be described in 

overall control unit CUa or register data in overall control detail with reference to FIG. 5. 

unit CUa is transferred via global bus GIB. Data transfer is Referring to FIG. 5, data input/output unit IOcO includes 

carried out via global bus GIB when operation is carried out detenrunation unit II, selectors S21 and S22, and bus drivers 

on a known common data with an operation instruction or 5 12 and D. 

when a common data is to be set in local memory LMband The data on local bus LB2 is selectively provided to either 

in an output register of an operation unit (not shown). output dO or dl by selector S21.I One data out inpute 

T e~ c |* S0-S3 is selected by selector S22. The selected data is 

Tie number of global buses OTBcc^sponds to fte fa tas ^ J n and 13, and provided to local bus 

mimber of processing elements PEbO-PEbn. Data transfer LBOorLBl according to a control signal applied via control 

can be carried out simultaneously between processing ele- io bus ^ ^ of selectors S21 and S22 are 

ments PEbO-PEbn. Data transfer is earned out via global coined by a contro i signal output from a determination 

bus GPB when the total sum of the operation results of un jt n on the basis of a control signal SG applied via control 

parallel-connected processing elements PEbO-PEbn is to be ^us CB. 

obtained, or when data is to be transferred between process- ^ stnjcture 0 f eacn 0 f data input/output units 

ing elements PEbO-PEbn disposed at constant intervals. is IOc0 _ IOc3 is substantially similar, provided that the func- 

Global bus GMB is used when data of local memory LMb ^ on ^ determination unit II deterniining the selection of 

in an arbitrary one of the parallel-connected processing data input/output between global buses GB0-GB7 differs. A 

elements PEbO-PEbn is to be transferred to all the other 3 . Wt gi 0 bai bus input/outout control signal SG is applied to 

processing elements. detenirination unit II from local bus CB. Global bus input/ 

In the present second embodiment, three types of global output control signal SG is applied in common to all 

buses are provided for data transfer, including a bus GIB for processing elements. The specification of an input/output 

distributing data from overall control unit CUa to all pro- select signal between the global buses differs for every data 

cessing elements PEbO-PEbn, a bus GPB connected to input/output unit. 

output local bus LB2 in all processing elements PEbO-PEbn The relationship between global bus input/output control 

vladatampul/oiitrnitumtlOb.andabusGMBforrTOViding s i gna i $G and selection of input/output of a data input/ 

data read out from local memory LMb of one of processing output unit will be described hereinafter with reference to 

elements PEbO-PEbn to all the other processing elements. piG. 6. 

Thus, data exchange between each processing element can Global bus input/output control signal SG represents the 

be carried out flexibly and speedily. Since data transfer is M distance between processing elements that exchange data, 

allowed via a plurality of global buses, various operations For example, SG="0ir implies that a processing element 

can be carried out at a higher speed. Thus, various process- receives data from the third rightward processing element 

ings can be executed. That \ s to say, processing element PEcO receives data from 

Third Embodiment processing element PEc3, and processing element PEcl 

Referring to FIG. 4, an SIMD processor includes eight 35 receives data from processing element PEc4. 

processing elements PEcO-PEc7, and eight global buses Data input/output unit IOcO provides data from output 

GP0-GP7. Each of processing elements PEcO-PEc7 port dl when control signal SG is 001-100, and otherwise 

includes data input/output units IOcO-IOc3. Although an from output port dO. Data input/output unit IOcl provides 

overall control unit, a control bus, a local memory, an MPY data from output port dl when global bus input/output 

block, a BMU block and an AU block are not illustrated in « control signal SG is 010-101, and otherwise from output 

FIG. 4 for the sake of simplification, the operation of each port DO. Data input/output unit IOc2 provides data from 

unit is similar to that shown in FIG. L Therefore, their output port dl when global input/output control signal SG is 

description wfll not be repeated. 001-110, and otherwise from output port dO. Data input/ 

Data input/output units IOcO, IOcl, IOc2, and IOc3 are output unit IOc3 provides data from output port dl when 

provided as the interface of respective global buses 45 global bus input/output control signal SG is 100-111, and 

GP0-GP7 in processing elements PEcO and PEc4, PEcl and otherwise from output port dO. Since the connection order of 

PEc5, PEc2 and PEcfi, and PEc3 and PEc7, respectively. a global bus and a data input/output bus is shifted for every 

Each of data input/output units IOcO-IOc3 includes two data input/output unit, the selection of an input is common 

outout ports dO and dl as the data output portion, and four for all data input/output units. More specifically, input ports 

input ports s0-s3 as the input portion of a processing so sO, si and s2 are respectively selected when the less sig- 

element Each port is connected to a predetermined one of nificant 2 bits of global bus input/output control signal SG 

global buses GP0-GP7. More specifically, the outputs from is 00, 61, and 11. respectively. 

data input/output unit IOcO of processing elements PEcO and In the SIMD processor of the third embodiment including 
PEc4 are connected to global buses GP0 and GP4. The 8 parallel processing elements, a 4-input and 2-output con- 
outputs from data input/output unit IOcl of processing 55 nection is provided with the global buses. The connection of 
elements PEcl and PEcS are connected to global buses GP1 a global bus with the input/output selection is differed in 
and GPS. The outputs from data input/output unit IOc2 of each processing element Therefore, data transfer between 
processing elements PEc2 and PEc6 are connected to global processing elements logically at equal intervals can be 
buses GP2 and GP6. The outputs from data input/output unit carried out depending upon a combination the global bus 
IOc3 of processing elements PEc3 and PEc7 are connected 60 connection and the input/output selection. Therefore, the 
to global buses GP3 and GP7. Output ports dO and dl are number of lines of connection between each processing 
connected to different global buses. Data from global buses element and the global buses is reduced. Furthermore, the 
GP0-GP3 are applied to the four inputs of processing circuit complexity of each processing element is reduced, 
elements PEcO-PEc3. Data from global buses GP4-GP7 are Since each processing element can carry out data transfer 
applied to the inputs of processing elements FEc4-FEc7. 65 logically at equal intervals, the data transfer between pro- 
Each of input ports S0-S4 is connected to a corresponding cessing elements is speeded, and can be carried out with 
global buses shifted by one in order. respect to various ccairnnations of processing elements. 
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Fourth Embodiment 

Referring to FIG. 7, an SIMD processor includes a 
plurality of processing elements PEdO-PEdn. Processing 
element PEdO includes a selector S30, a local memory 
LMcO, and local buses LB*-LB2. Each of the other pro- 
cessing elements EEdl-PEdn includes similar components. 
For the sake of simplification, an overall control unit, a 
control bus, a global bus, a data input/output unit, an ALU 
block, an MPU block a BMU block and an AU block are not 
illustrated in FIG. 4. They have a structure and operation 
similar to those shown in FIG. 1, and their description will 
not be repeated. 

Selector S32 receives external inputs EX and EXO, and 
the data of local bus LB2. Selector S3© selects an input data 
and provides the same to local memory LMcO. Local 
memory LMcO stores the input data, and provides the stored 
data to local buses LBO and LB1, and to selector S31. 
Similarly, the data of a local memory is sequentially output 
to a local memory of a subsequent stage. More specifically, 
data read out from a local memory is provided to local buses 
LBO and LB1 from which one is provided to the local 
memory of a right-positioned processing element The data 
to be written includes data LD0~LDn-l output from the 
local memory of a left processing element external inputs 
EXO-EXn, and local bus LBX Selectors S30-S3n selects 
one of these three data. An input to the leftmost processing 
element from an adjacent processing element is an external 
input EX. Each of external inputs EXO-EXn connected to 
each processing element are individual external inputs inher- 
ent to a processing element 

Local processing in filtering will be described hereinafter 
with reference to FIG. 8. A 3x3 local filtering process is 
applied as set forth in the following. Processing element 
PEdO always carries out processing on the bottom line of a 
local window W. Processing element PEdl carries out 
processing on the last but second horizontal line in local 
window W. Processing element PEd2 carries out processing 
on the last but third horizontal line in local window W. The 
three processing elements always apply processing on pixels 
of identical position in the horizontal direction. Image data 
LDi obtained by raster-scanning an image P is sequentially 
input as external input EX of processing element PEdO to be 
written into local memory LMcO. When data transfer of one 
line in the horizontal direction of image P is completed, data 
transfer of the next line is initiated together with the start of 
a process on an already written line. 

Image data of the same horizontal position of every 
differing line is read out from each local memory of all the 
processing elements. The readout image data becomes the 
data to be written into the local memory of an adjacent 
processing element to be stored in a same address. By 
carrying out the above-described operation, image data of 
one horizontal tine in each local memory is completely 
transferred to an adjacent local memory when the processing 
of one line is finished. 

When image P is divided into a plurality of regions and a 
processing element carries out processing on each divided 
region, image data is selected and provided to a local 
memory via an each individual external input 

In the local memory of the SIMD processor of the fourth 
embodiment the data read out from an adjacent local 
memory can be selected as data to be written into a current 
local memory. Therefore, local memories can be connected 
in series in a chain-tike manner. The local memory may 
serve as a line memory effective for image processing. Since 
each local memory allows writing of an individual external 
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input, high speed data transfer is realized, and parallel 
operation is possible. Therefore, high speed processing is 
allowed, and can be used for a wide range of applications 
since the above-described processes can be carried out 

5 selectively. 

Fifth Embodiment 

Referring to FIG. 9, an SIMD processor includes a 
processing element PEe and a control bus CB. Processing 
element PEe includes a local memory LMd and local buses 

10 LB0-LB2. Local memory LMd includes selectors S41-S46, 
a register R21, and bank memories Ba-Bc. For the sake of 
simplification, an overall control unit, a global bus, an ALU 
block, an MPY block, a BMU block, an AU block and a data 
input/output unit are not illustrated in FIG. 9. The structure 

15 and operation thereof are similar to those shown in FIG. 1, 
so that their description will not be repeated. The number of 
processing elements is arbitrary, although only one process- 
ing element PEe is shown in the present embodiment 
Local memory LMd has a 3-bank structure in which three 

20 bank memories Ba-Bc of toe same capacity are arranged in 
parallel. Each of bank memories Ba-Bc receives read and 
write enable signals enaMeA-enableC via control bus CB 
and addresses adrA-adrC Two bank memories can be used 
for reading, and one bank memory can be used for writing 

25 at the same time. The output of each of bank memories 
Ba-Bc is connected to local buses LBO and LB1 via selec- 
tors S4S and S46, respectively. An output of any of the bank 
memories is provided to local buses LBO and LB 1 according 
to address select signals selSO and selSl of control bus CB. 

30 As a write data, the data on local bus LB2 or the data applied 
from outside processing element PEe is selected by selectors 
S41 in response to control signal selW. 
The data on local bus LB2 which is then stored in register 

,, R21 or addresses adrA-adrC provided from the overall 
control unit (not shown) via control bus CB can be selected 
by selectors S42-S44 as an address. In carrying out a table 
look up process using an operation result the bank memory 
storing the look up table is already known. Therefore, by 
providing an readout enable signal to that bank memory and 
by providing an address select signal to select the operation 
result stored in register R21 as an address, data is read out 
from the same bank memory in all the processing elements. 
Since the operation result stored in register R21 can be 
selected as an address, a process can be carried out in which 
the readout address differs in each processing element. The 
operation result can be selected, not only as a readout 
address, but also as a write address. Therefore, a read modify 
operation can be carried out where the result of an operation 

30 carried out on data of an address obtained by the operation 
result is written again into the same address. 

In the local memory of the SIMD processor of the fifth 
embodiment three individually controllable bank memories 
are provided. Two of the three bank memories can be used 

ss for a readout operation, and the remaining one bank memory 
can be used for a writing operation simultaneously. 
Therefore, high speed processing is allowed. Since a register 
R21 storing the operation result is provided, the operation 
result stored in register R21 can be selected as an address, 

6 whereby individual addressing is allowed in toe processing 
elements operating in parallel under the same control signal. 
Therefore, processing of a high level can be realized. 

Sixth Emrxxtiment 

FIG. 10 is a block diagram showing the structure of an 
65 SIMD processor of toe sixth embodiment. The SIMD pro- 
cessor of FIG. 10 differs from the SIMD processor of FIG. 
1 in that a PE operation control unit POCa is additionally 
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provided, and control signals cALU, cMPY, cBMU, and signal corresponding to an instruction included in the third 

cAU are provided from respective operation unit blocks. The stage and et seq. of an operation of a composite instruction 

remaining elements are similar to those of HG. 1, and prior to the above subsra instruction, a control signal cor- 

corresponding components have the same reference charac- responding to the instruction of the first stage of the instmc- 

ters denoted. 5 tion applied subsequent to the above subsra instruction, or a 

An ova-all control unit CUb stores an instruction train that control signal corresponding to nop is provided to other 

is a mnemonic code converted into a binary code, and operation units. The shift register shifts the control signal by 

provides a control signal via control bus CB so that pro- one stage. At the same time, insertion of a control signal to 

cessing elements PEfO-PEfn are sequentially executed. PE the shift register with respect to the instruction applied at the 

operation control unit POCa receives a 6-bit operation code J0 current stage is carried out similar to the above subsra 

opcode in an instruction via control bus CB. instruction. 

The PE operation control unit of FIG. 10 will be described By repeating the above process sequentially, a subsra 

in detail with reference to FIG. 11. instruction is carried out over four stages from a sub 

Referring to FIG. 11, a PE operation control unit POCa instruction with respect to the arithmetic logic unit 

includes a decode unit DU for decoding an operation code furthermore, a decoding operation of continuously apphed 

opcode, selectors S51-S59 for transferring control signals " instructrons subsequent to a subsra inshjcUon and a deter- 

cALU, cMPY, cBMU, and cAU for respective operation nunatlon "f*™ ° f *• number of P* e sta *? 316 



units, and registers R31-R40. 

The number of stages of shift registers differ for each 
operation unit The number of registers for an arithmetic 2 



out to sequentially execute the instructions without waiting 
for completion of the subsra instruction. 
In the SIMD processor of the sixth embodiment, a struc- 



SSa^S^W^^a^^ 20 ture is provided in which each processing element decodes 

one stage, two is/three 3ages, and four stages, respec- *n operation code from the overaU control unit using pipe- 

thefeSdccft» S51-S59 inserted between Ae registers Hneregister for an instruction and also .makes detenmnaUon 

Seiner an input from a register of a preceding Jagecr °* toeP^tion of inserting ;the decodedresulti^oapmebne 

^control signal from decodeTnit DU byjipdiie delay „ A control signal with respect to the operation 

^1 pl-pl from decode unit DU and p£EL same * ^"1^ " ^'TT ™SS 

„f „ «„„ .*,.«. pipeline structure is generated internal of each processing 

to a register of a succeeding stag* Element. Therefore, the control signal output from the over- 

The operation of the i above PE cpenUrou m ^ ^ „ cxclusivelv ^ opcra . 

be described in detail. HG 12 shows fte rehtoonshrp number of control buses can be reduced, 

between an instruction and control in the SIMD processor of 30 

FIG. 11. For example, when a sum of squared difference Seventh Ernbodirnent 

indicated by a mnemonic code of subsra shown in FIG. 12 Referring to FIG. 13, an SIMD processor includes an 

is to be carried out a pipeline process is required in which overall control unit CUa, a control bus CB. a global bus GB, 

an operation result is sequentially sent to all the operation «>d a plurality of processing elements PEg«-PEgn. 

units over the four stages to obtain the final result Here, an 35 Each of processing elements PEgO-PEgn includes a local 

operation code opcode=0000 1 1 is applied to decode unit DU memory LMa, a data input/output unit IOa, an ALU block 

from overall control unit CUb. Decoder unit DU issues a sub ALB, an MPY block MB, a BMU block BB, an AU block 

instruction to the arithmetic logic unit for obtaining a AUB, and a PE operation control unit POCb. 

difference. In order to carry out multiplication at a stage PE operation control unit POCb includes selectors S61 

subsequent to the process of the sub instruction, a pipeline 40 and S61, a comparison determination unit CP, a PE activa- 

delay signal pi to the multiplier is set to 10 so as to insert tion signal register PAR, and AND circuits G1-G3. 

a mpy instruction to a register of a succeeding stage to The SIMD processor of FIG. 13 is similar to that shown 

control an arithmetic logic unit Also, an ars instruction is in FIG. 1 except that a PE operation control unit POCb is 

issued towards a bit operator to execute an arithmetic right additionally provided Corresponding components have the 

shift operation for arranging the digit figures in accumulat- 45 same reference characters denoted, and their description will 

ing the multiplied results. Pipeline delay signal p2 is set to not be repeated. ALU block ALB, MPY block MB, BMU 

100 to insert an ars instruction to a register of a succeeding block BB, and AU block AUB generate and provide to PE 

stage to control the multiplier. Then, an add instruction is operation control unit POCb flags ftagALU, flagMPY, 

issued to an accumulator to accumulate the shifted results. flagBMU, and flagAU, respectively, when the operation 

Pipeline delay signal p3 is set to 1000 to insert an add 50 result indicates an overflow, a negative value, or 0 in 

instruction in a register of a succeeding stage to control the response to the operation result 

bit operator. PE operation control unit POCb receives control signals 
Issuance of an instruction and generation of a pipeline f0-f3 from overall control unit CUa via control bus CB 
delay signal are carried out at the same time. An instruction which are provided to each operation unit and condition 
is inserted into a predetermined stage. As a result, a control 55 detenmnation code CDC. Condition determination code 
signal is applied so as to carry out a sub instruction with CDC specifies an operation unit that provides a flag. Con- 
respect to an arithmetic logic unit at the next stage where an dition determination code CDC is applied to comparison 
instruction is inserted. Also, an instruction included in the determination unit CP where determination is made whether 
second stage and et seq. of the operation of a composite a flag applied via selector S61 is a desired flag or not 
instruction prior to the above subsra instruction or a control 60 Comparison determination unit CP outputs 1 and 0 when the 
signal indicating no operation (nop) is applied to other flag of a selected operation unit and condition detenriination 
operation units. The shift register shifts the applied control code CDC from overall control unit CUa match or not 
signal by one stage. Here, insertion of a control signal into match, respectively. The output result is provided to PE 
the shift register with respect to an applied instruction is activation signal register PAR via selector S62. PE activa- 
carried out similar to the above subsra instruction. « tion signal register PAR maintains the value until selector 
At the next stage, a control signal for carrying out a mpy S62 is reset to 1 by a reset signal rst from overall control unit 
instruction with respect to the multiplier is output A control CUb. 
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The data stored in PE activation signal register PAR is tion result The manner of inserting PE activation signal 

ANDed with control signals f«-f3 with respect to each enablePE is identical to the manner of inserting a control 

operation unit applied from overall control unit CUa to PE signal in the PE operation control unit of PIG. 11. 

operation control unit POCb by AND circuits G1-G4. The The operation of the pipeline register of FIG. IS will be 

results become control signals cALU, cMPY, cBMU, cAU 5 described in detail with reference to FIG. 16. 

of each operation unit Therefore, when condition deterred- Referring to FIG. 16 corresponding to a pipeline register 

nation code CDC is applied and the determination result is 0 f one arbitrary processing element, the operation up to time 

I, control signals f<M3 from overall control unit CUa are g includes decoding of a subsra instruction by overall 
directly applied to respective operation units, whereby each control unit CUb, delay of a control signal towards respec- 
operation unit carries out a predetermined operation accord- 10 tf ve operation units by an operation control signal pipeline 
ing to the control signal. When the determination result is 0, register in pipeline operation control unit PLC, and appli- 
the instruction provided from overall control unit CUa is cation of the delayed signal to each of processing elements 
masked, and a control signal corresponding to hop is pro- PEhO-PEhn. At time t4, an instruction for making determi- 
vided to respective operation units. This means that an 0 f a flag of any of the operation units is applied, and 
operation unit will not operate until the data stored in PE is 0 ^ stored £n pe activation signal register PAR indicating 
activation signal register PAR is reset to 1 by overall control mat me determination result does not match condition 
unit CUa. determination code CDC from overall control unit CUb. 

la the SIMD processor of the seventh embodiment, In pipeline register PLR, 1 is stored in all the registers for 

respective operation units generate flags flagALU, flagMPY, continuously operating each operation unit until a condition 

flagBMU, and flagAU corresponding to the operation result m determination result is obtained. At a condition determina- 

and provides the generated flag to PE operation control unit tion instruction execution mode, nop is inserted in pipeline 

POCb. In PE operation control unit POCb, a flag of an registers PLR for all operation units, 

operation unit is compared according to condition determi- a is ^suiatd that a mac instruction (a sum of products) 

nation code CDC from overall control unit CUa to generate succeeds fce condition determination code. Pipe delay sig- 

a mask signal with respect to the succeeding instructions. It « nals are smaUer than a subsra instruction respectively by 1 

is therefore possible to selectively render operative a plu- ^ yalue Le pl=01 p2 =010, p3=0100. Therefore, in 

rality of processing elements operating in parallel under the pipeline register PLR, the nop inserted at the final stage of 

same control signal according to the comparison result of me pipeline register which is the immediate preceding 

respective operation units. Thus, a variety of processes can instruction, Le. a condition determination code, will not be 

be executed in the SIMD processor of FIG. 7. 30 transmitted to the next stage, and a control signal corre- 

Eighth Embodiment sponding to mpy, ars, and add instructions required for the 

FIG. 14 is a block diagram showing a structure of an mac instruction are inserted according to the pipeline delay 

SIMD processor according to an eighth embodiment of the value. Similarly, in pipeline register PLR, 1 inserted in the 

present invention. The SIMD processor of FIG. 14 differs final stage of the pipeline register will not be transmitted to 

from that of FIG. 13 in that a pipeline operation control unit a succeeding stage, and 0 which is the condition determi- 

PLC is provided for providing pipeline delay signals pl-p3 nation result is inserted according to pipeline delay values 

similar to those generated by a decode unit DU shown in pl-p3. 

FIG. 11 within overall control unit CUb. Furthermore, a ^t the time of inserting PE control signal enablePE to a 

pipeline register PLR is additionally provided in PE opera- pipe with respect to the mac instruction of time t5, control 

tion control unit OPCc. The remaining components are signals enableBMU and enableAU with respect to the bit 

similar to those of the SIM of FIG. 13, and corresponding operator and the accumulator, respectively, remain 1 corre- 

components have the same reference characters denoted. sponding to a subsra instruction prior to execution of the 

Pipeline register PLR is inserted between PE activation condition determination. This control signal is subject to 

signal register PAR and AND circuits G1-G4. PE operation 4J pipe-delay and men output Although PE activation signal 

control unit POCc delays data provided from PE activation enablePE is already 0, the subsra instruction prior to execu- 

signal register PAR according to pqseline delay signals tion of the condition determination is carried out until the 

pl-p3 through a pipeline by pipeline register PLR. AND final stage. • is inserted in pipeline register PLR after the 

circuits G1-G4 take the logical product of the pipeline- mac instruction at time tS. 

delayed data and respective control signals f<M3 provided 50 Although a processing element is described when the 

from overall control unit CUb. Control signals cALU, condition determination result is 0, there is also the possi- 

cMPY, cBMU. and cAU are provided to respective opera- bility of a condition determination result of 1 in another 

tion units. Pipeline operation control unit PLC includes processing element In this case, 1 is inserted in the pipeline 

instruction pipeline registers similar to those shown in FIG. register PLR in that processing element and all the suc- 

II. Each instruction is subject to a pipeline-delay corre- J5 ceeding instructions are executed. 

sponding to a composite instruction to be provided to all as time t8, an instruction for setting PE activation signal 
processing elements PEhO-PEhn. register PAR to 1 is applied. PE activation signal register 
The pipeline register of FIG. 14 will be described in detail PAR is set to 1 in all processing elements, and 1 is inserted 
with reference to FIG. 15. into pipeline register PLR to have the succeeding ins true- 
Referring to FIG. 15, pipeline register PLR includes » tions executed by all the processing elements. In FIG. 16, the 
registers R51-R60 and selectors S81-S89. subsra instruction subsequent to time t9 is sequentially 
Although pipeline register PLR has a structure similar to executed in a manner similar to that carried out from time tO 
the shift register portion of a control signal of the PE to t3, where insertion is carried out with pipeline register 
operation control unit of FIG. 11, it is to be noted that the bit PLR and PE activation signal register PAR 
width of registers R51-R6* is 1 bit and the inserted signal 65 In the SIMD processor of the eighth embodiment, PE 
is a PE activation signal enablePE output from PE activation activation signal enablePE is inserted into pipeline register 
signal register PAR storing the same condition deterrnina- PLR for respective operation units according to pipeline 
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delay signals pl-p3. Control signals cALU, cMPY, cBMU Using the maximum value minimum value detection 

and cAU provided from PE operation control unit POCc are function, the outputs of the processing elements can be 

obtained as a result of. Control signals enableALU, sorted in the descending order/ascending order. FIG. 18 is a 

enableMFY, enableBMTJ, enableAU which serve as an diagram for describing a sorting process of extracting the 

operation unit activation signal being ANDed with control 5 three greatest data of the outputs of all the processing 

signals f0-f3 prior to being provided to respective operation elements. 

units with a pipe delay identical to mat of control signals The three greatest data are stored in local memory LML. 

f0-f3 provided from overall control unit CUb. When a The largest, the second largest, and the third largest value are 

processing element is to be selectively activated upon con- stored in address=0, address=l and address=2. respectively, 

dition determination, it is not necessary to wait for the 10 Data other than the three highest level are discarded. The 

completion of a preceding composite instruction to issue a smallest possible value of the data to be sorted is stored in 

condition determination instruction. It is also not necessary each address of local memory LML. 

to insert a nop in the instruction train. Therefore, instruction when all the outputs of processing elements are provided 

description is facilitated and a condition determination to link processing unit LOUa, a maximum value detection 

instruction can be described at an arbitrary position. The is operation is carried out between the value on global bus GPO 

number of instruction steps can be reduced to realize high and the value stored in address=0 of local memory LML. 

speed processing. Furthermore, since a pipeline process is The greater value is written into address=0 of local memory 

carried out with one overall control unit CUb, the circuit LML, and the smaller value is stored in register R71. Next, 

complexity of each of processing elements FEhO-PEhn can a maximum value detection operation is carried out in a 

be reduced. 20 similar manner between the data in address=l of local 

Ninth Embodiment memory LML and the lower value of the prior maximum 

Referring to FIG. 17, an SIMD processor includes an value detection operation stored in register R71. The greater 

overall control unit CUa, a control bus CB. a global bus GB value thereof is written into address=l of ^memory 

(GPO-GPn) processing elements PEaO-PEan, and a link LML, and the smaller value is stored in register R71 

p^ta^muTl^^ 25 Similarly, a maximum value dete^oa operation is earned 

f.equeS unit SEQ, an interf^ unit L, sectors S91 and ^between thedata ess=2 of local memory 

S92, a register R71, an arithmetic logic unit ALL and a local LML and the data in register R71. 

memory LML. The SIMD processor of FIG. 17 differs from According to the above process, a maximum value detec- 

that of FIG 1 in that a link processing unit LOUa is tion operation and data exchange are earned out three times 

additionally provided. The remaining components are sinri- 30 between the output of one processing element and the three 

lar to those of the SIMD processor of FIG. 1, and corre- highest value stored in local memoryLML. By repeating the 

sponding components have the same reference characters above process from global bus GPO to GPn. the three 

denoted. greatest data out of the outputs of n+1 processing elements 

Link processing unit LOUa is connected to all global 3J can be derived^ carrying a ^,^ 0 ^^. 

buses GPO-GPn and control bus CB. link processing unit 35 to the output of the next processing tjeo^ut 

LOUa includes a sequence unit SEQ for controlling the of die c^ent processuig dement die three highest data 

processing sequence in link processing unit LOUa according outout of 2 (rri-1) data can be derived, 

to a control signal from overall control unit CUa, an inter- The = p«tputs of the processmg elements can be s«tedjn an 

face unit IFa with global bus GB, an arithmetic logic unit ascending order by carrying out a nunimum value detection 

ALa for carrying out addition, maximum value minimum 40 operation instead of the above-described maximum detec- 

value operation, a register R71 and a local memory LML for tion operation. 

storing an output of arithmetic logic unit ALa, and selectors In the SIMD processor of the ninth embodiment, a link 

S91 and SS2 processing unit LOUa connected to a global bus includes an 

When a control signal is applied to sequence unit SEQ 45 arithmetic logic unit ALU and a local memory LML allowing 

from overall control umTcUa via control bus CB to take me addition and maximum value minimum value deteouon 

total sum of data output from processing elements operations. Since accumulation and sorting of the outputs of 

PEaO-PEan to global bus GB. sequence control unit SEQ the processmg elements can be carried outwthouthavmgto 

generates a comrol signal with respect to each element in ^change data between the processmg demente, the speedof 

Unk processing unit to carry out the following operations. 50 processmg due to parallel arrangement and the integration 

F * . _ .. j i . „ 30 function of data of the parallel processing elements can 

First, interface unit IFa provides output data from all J™™ "J ~J ^ F 6 

processing elements output at the same time in parallel to lur ~" ^"f™ 

link processing unit LOUa via global buses GPO-GPn. Tenth Embodiment 

Then data of global buses GPO-GPn are sequentially FIG. 19 is a block diagram showing a structtrre of an 

applied to arithmetic logic unit ALa via selector S91, 55 SIMD processor according^ a tenth einbodment of the 

whereby arithmetic logic unit ALa accumulates sequentially present invention. The HMD processor of FIG. 19 differs 

innut data mine recister R71 from the SIMD processor of FIG. 17 in that interface unit 

mput data usmg register K71. , . t . Wdl & mod ined into a plurality of outputs of interface units 

When the maximum value/minimum value is to be "J 1 JUUU " . . ^ ^ . „ • ■,„ . . ,„ 

extracted from the outputs of all the processing elements, all ™>- The rernamm^rn^nenu are similar, and have the 

the outputs of the processing elements are provided to link 60 same reference charters denoted 

processing unit LOUa similar to the above accumulation n *" dndes fOUr ° U ^T nTT ^f d^f 

PKxess.ThedataaresequentiaUyapphVritoari are provided from knk processmg umt LOTJb The data of 

unit ALa. The current rnaximum value/imnimum value is each of outputs outO-out3 has a bit width of 16 bits, 

stored in register R71. By comparing the stored maximum Interface unit IFb will be described in detail with refer- 
value/miniraum value with the next input, the maximum 65 ence to FIG. 20. 

value/mimmum value output from all the processing ele- Referring to FIG. 20, interface unit IFb includes registers 

ments can be extracted. R81-R88, and selectors SlOl-^SUO. Interface unit IFb 
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receives the outputs of all the processing elements by eight 
16-bit registers R81-R88 to carry out an operation on ail 
output data of processing elements. The entered output is 
provided to arithmetic logic unit ALu via 8-input selector 
Silt and selector S91. An operation result aluOUT in link j 
processing unit LOUb is provided as output outO via 2-input 
selector 109. 

When operation of data between processing elements is 
not required, the data is divided into upper and lower 8 bits 
to be provided via 4-input selectors S101-S108 connected to 10 
four output portions. The input route of data towards 4-input 
selectors S101-S108 is as shown in FIG. 20. For example, 
GP7 <15:8> implies the upper 8 bits of data on global bus 
GF7, and GP7 <7:0> implies the lower 8 bits of data on 
global bus GP7. 

As a first output manner where data of processing ele- 
ments PEaO-PEa3 are to be output, the last stage of the 4 
input data is selected in all 4-input selectors S101-S108. 
Therefore, the outputs of processing elements PEaO, PEal, 
PEa2, and FEa3 are provided as outO, outl, out2, and out3, 
respectively. 20 

As a second output manner where data of processing 
elements PEa4-PEa7 are to be output, the bottom but second 
data of the 4 input data is selected in all 4-input selectors 
S101-S108. The outputs of processing elements PEa4, 
PEaS, PEa6, and PEa7 are provided as outO, outl, out2, and 25 
out3. respectively. 

As third output manner where the lower 8 bits of the 
output data of all processing elements are to be output, the 
bottom but third data of the 4 input data is selected in all 
4-input selectors S101-S108. The lower 8 bits of processing 
elements PEaO, PEal, PEa2. and PEa3 are provided to the 
lower 8 bits of output outO, the upper 8 bits of output out 0, 
the lower 8 bits of output outl, and the upper high bits of 
output outl, respectively. The following processing ele- 3J 
ments PEa4-PEa7 are similarly operated to provide the 
outputs as out2 and out3. 

As a fourth manner where the upper 8 bits of the output 
data of all the processing elements are output, the top stage 
data of the 4 input data is selected in all 4-input selectors m 
S101-S108. The upper 8 bits of processing elements PEaO, 
PEal, PEa2, and PEa3 are provided to the lower 8 bits of 
output outO, the upper 8 bits of output outO, the lower 8 bits 
of output outl, and the upper 8 bits of output outl, respec- 
tively. Similarly, the following processing elements 45 
PEa4-PEa7 are similarly operated to provide outputs as out2 
and out3. 

In the SIMD processor of the tenth embodiment where 8 
parallel processing elements are provided, the outputs of the 
processing elements are provided via link processing unit J0 
LOUb, whereby the output of each processing element is 
divided into upper data and lower data to be selectively 
output according to the output of link processing unit LOUb 
as 4 outputs. Therefore, the outouts of the processing ele- 
ments can be provided to the outside world in various output 5J 
modes. Output is enabled with a reduced number of external 
output lines with respect to the 8 parallel outputs. 



unit ALb for a sorting operation. The incremented value is 
provided to selector S93 of a succeeding stage for the 
purpose of data exchange with local memory LFLLb. 
According to the above operation, an inherent code can be 
allocated to an output of a processing element provided via 
interface unit IFa. Addressing of local memory LMLb and 
control with respect to the two 2-input selectors S93 and S94 
of the first stage are similar to those of local memory LMLa 
and selectors S91 and S92 connected to arithmetic logic unit 
ALb. 

The two 2-input selectors S95 and S96 of the second stage 
are controlled according to the operation result of arithmetic 
logic unit ALb to carry out data exchange similar to that of 
arithmetic logic unit ALb. Register R72 and local memory 
LMLb connected to 2-input selectors S95 and S96 store 
inherent data allocated to each data parallel to data exchange 
of outouts of processing elements. Therefore, it is easy to 
identify which output of a processing element the upper or 
lower terms of data extracted as a result of sorting of data 
applied to link processing unit LOUc from parallel process- 
ing element FEaO-PEan comes from. It is also possible to 
identify the processing element and the order of output the 
extracted data is by repeating data input to link processing 
unit LOUc 

In the SIMD processor of the eleventh embodiment, 
sorting outouts of processing elements in a link processing 
unit LOUc can be carried out by allocating a code with 
respect to an output of each processing element Therefore, 
a processing element per se does not have to generate a code 
in contrast to a process such as a vector matching process 
where identification is required by allocating a code to the 
data. Accordingly, the circuit complexity of the processing 



speeded since code allocation and sorting are executed 
parallel to operation in a processing element 

Arbitrary combinations of toe above structures of the first 
to eleventh embodiments are allowed. In such a case, an 
effect similar to that described in each embodiment can be 



Referring to FIG. 21, an SIMD processor of an eleventh 
embodiment includes selectors S93-S96, registers R72 and 60 
R73, a local memory LMLb, and an incrementer ADU, in 
addition to the components provided in the SIMD processor 
of FIG. 17. Corresponding components of the SIMD pro- 
cessors of FIGS. 17 and 21 have the same reference char- 
acters denoted, and their description will not be repeated. 6S 

Incrementer ADU increments its value by 1 every time 
data is applied from interface unit IFa to an arithmetic logic 



Although the present invention has been described and 
illustrated in detail, it is clearly understood that the same is 
by way of illustration and example only and is not to be 
taken by way of limitation, the spirit and scope of the present 
invention being limited only by the terms of the appended 

What is claimed is: 
1. A SIMD processor comprising: 
overall control means; 
a plurality of processing el 
a global bus for com 



with each of said plurality of processing elements, 

each of said plurality of processing elements comprises 
a local memory, 
plurality of operation means, 
data input/output means, 

a local bus connected to said local memory, said 
plurality of operation means, and said data input/ 
output means for transferring data, and 

a composite operation bus connected to each of said 
plurality of operation means for transferring data 
to carry out a composite operation, 
said overall control means controls each operation of 

said plurality of processing elements so as to carry 

out the ss 
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said local bus comprises 

two data input local buses for entering data of said 
plurality of operation means, and 

one data output local bus for providing data from 
said plurality of operation means, '. 
said plurality of operation means comprises 

arithmetic logic operation means including an arith- 
metic logic unit, 

multiply means including a multiplier, bit operation 
means including a bit operator 1 

and 

accumulation means including an accumulator, 
and 

said composite operation bus comprises 

a first composite operation bus for providing an l 
output data of said arithmetic logic unit to said 
multiplier, said bit operator, and said accumulator, 

a second composite operation bus for providing an 
output data of said multiplier to said bit operator 
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a third composite operation bus for providing an 
output data of said bit operator to said accumula- 

2. The SIMD processor according to claim 1, wherein said 
arithmetic logic operation means further comprises 25 
first and second arithmetic logic registers for storing data 

provided from said two data input local buses, 
a third arithmetic logic register for storing data provided 

from said arithmetic logic unit, and M 
arithmetic logic selector for selectively providing data 

output from said third arithmetic logic register to one of 

said two data input local buses and one data output 

local bus, 

wherein said arithmetic logic unit carries out an arithmetic 35 
logic operation process on data output from said first 
and second arithmetic logic registers, 

wherein said multiply means further comprises 

a first multiplication selector far receiving data output 
from one of said two data input local buses and data 40 
output from said first composite operation bus and 
providing one of the two received data, 

a second multiplication selector for selecting and provid- 
ing one of data output from the other of said two data ^ 
input local buses and data output from said first com- 
posite operation bus, 

a first multiplication register for storing data output from 
said first multiplication selector, 

a second mulnphcation register for storing data provided 
from said second multiplication selector, 



a third multiplication register for storing data provided 
from said multiplier, and 

a third multiplication selector for selectively providing 
data output from said third multiplication register to 
one of said two data input local buses and one data 
output local bus, 

wherein said multiplier multiplies data output from said 
first and second multiplication registers and provides 
the multiplied data to said third multiplication register, 

wherein said bit operation means further comprises 

a first bit operation selector for selectively providing one 
of data output from one of said two data input local 
buses and data output from said second composite 
operation bus, 

a second bit operation selector for selectively providing 
one of data provided from the other of said two data 
input local buses and data output from said first com- 
posite operation bus, 

a first bit operation register for storing data provided from 
said first bit operation selector, 

a second bit operation register for storing data provided 
from said second bit operation selector, 

a third bit operation register for storing data provided 
from said bit operator, and 

a third bit operation selector for providing data output 
from said third bit operation register selectively to one 
of said two data input local buses and one date output 
local bus, 

wherein said bit operator carries out a bit operation 

process on data output from said first and second bit 

operation registers, 
wherein said accumulation means further comprises 
a first accumulation selector for selectively providing one 

data out of data output from said first to third composite 

operation buses, 
a first accumulation register for storing data provided 

from said first accumulation register, 
a second accumulation register for storing data provided 

from said accumulator, and 
a second accumulation selector for providing data output 

from said second accumulator register selectively to 

one of said two data input local buses and one data 

outout local bus, 
wherein said accumulator carries out an accumulation 

operation using data outout from said first and second 

accumulation registers. 



